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ABSTRACT 

This paper focuses on utihzing two different Bayesian methods to deal with a variety of 
toy problems which occur in data analysis. In particular we implement the Variational 
Bayesian and Nested Sampling methods to tackle the problems of polynomial selection 
and Gaussian Mixture Models, comparing the algorithms in terms of processing speed 
and accuracy. In the problems tackled here it is the Variational Bayesian algorithms 
which are the faster though both results give similar results. 
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1 INTRODUCTION 

Parameter estimation within the Bayesian framework rests 
on the application of Bayes theorem to data analysis. If we 
consider a parameter set 6, a data set D and all prior knowl- 
edge I Bayes theorem tells us that: 



P(6>|D,I) 



L(D|6>,I)P(6>|I) 
P(D|I) 



(1) 



Here we have: 



• The Posterior Probability P(^|D,I) which represents 
our belief in the hypotheses once we have analysed the data 
available. 

• The Prior Probability P(^|II) which encodes our previ- 
ous knowledge of the system under examination. 

• The Likelihood Function L(D|^,I) which is the proba- 
bility of observing the data for a given set of parameters. 

• The Evidence P(D|I) which is a normalisation constant 
given by the probability of the data. 

In the specific case when the problem being faced is one of 
parameter estimation we can ignore the denominator which 
is independent of 0. We thus get: 



P(6'|D,I) oc L(D|6',I)P(6'|I) 



(2) 



Here the prior knowledge is being transformed by the data 
through the likelihood to give the posterior distribution 
which embodies our new beliefs. 

Bayesian methodology can also be applied to model se- 
lection scenarios to choose between competing hypotheses 
or models Mi. Given a model M we can write down, using 
Bayes theorem, the probability that M is the correct model 
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given the data D and our previous knowledge I: 



(3) 



This is analogous to Equation Q which deals with parame- 
ter estimation: just as the posterior probability distribution 
function for a parameter is proportional to its prior times 
its likelihood, so the posterior probability for a model as a 
whole is proportional to its prior probability times its Evi- 
dence. P(D|I) itself cannot be calculated but model selection 
can still be performed by comparing the probability of two 
models Mi and M2 using a ratio, called the posterior ratio 
or odd's ratio: 

P{Mi\DJ) _ P(Mi|/) P{D\Mi,I) 



P{M2\DJ) P{M2\I) P(P>|,M2,/) ^"^^ 

When we have no prior preference for one model over an- 
other Equation Q reduces to a ratio of Evidences. Marginil- 
isation and Bayes theorem allow us to express the Evidence 
as an integral over the parameter space: 



P(D|M, I) 



P{e\M, I)L(D|6',M, l)dO 



(5) 



Calculating this integral then allows the experimentalist to 
select the optimum model to describe a data set. We now 
discuss two different techniques which lead to a value the 
Evidence and further on compare them in terms of compu- 
tational speed and accuracy. 



2 NESTED SAMPLING 

Nested sampling (Skilling, 2006) is a modern Bayesian tech- 
nique which transforms the multidimensional Evidence in- 
tegral of Equation ([5| into a simpler one dimensional one. 
This is done by considering the prior mass x ^"^^ i^s con- 
stituent elements dx — P(^|II)<i^. These can be summed up 
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Figure 2. We see the separate steps of the Nested Samphng 
algorithm: (a) On entry we have n objects with L > L* or x < X* • 
(b) The one with largest x and call it x* , removing it from our list 
which now contains n — 1 objects sampled from the Prior, (c) We 
generate a new object, sampled from the Prior once again, but 
this time constrained to lie within the new Likelihood domain. 



Figure 1. The above diagram, taken from (Skilling, 2006) shows 
how points in the parameter space are sampled such that they 
satisfy the Likelihood constraint L > L* . 

as follows: 

x(A) = / p{e\i)de (6) 

This covers all the prior mass corresponding to a parame- 
ter space with likelihood greater than A. This provides for 
the required transformation to one dimension. The Evidence 
integral then becomes: 

Evidence = / L{x)(^X (7) 
^0 

Because of the way we have ordered our elements in terms 
of Likelihood we can evaluate Li — L(xi) ctt a sequence of m 
points in the parameter space having decreasing Likelihood 
as shown in Figure [l] 

< Xm < Xm-1 < ... < X2 < Xl < 1 (8) 

If we set hi = Xi — Xi-i the above ordering allows us to 
evaluate the integral as a weighted sum of the Likelihood: 

m 

Evidence = ^ hi Li (9) 

Since we expect that the largest contributions to the integral 
to come from relatively small regions of peaked Posterior it 
makes more sense to sample points in x at a logarithmic 
rate instead of a linear one so that initial sampling of the 
shallow Posterior is rapid. We therefore take: 

Xl=^li X2=tlt2, X^=^1^2...t^ (10) 

where each of the t^, known as the shrinkage ratio, lies be- 
tween and 1. In practice the Nested Sampling algorithm 
implements these concepts as follows: 

(i) N objects are sampled randomly from within the prior 
and their likelihoods are evaluated. Initially we have the full 
prior range from to xo = 1 available to sample from. 



(ii) We then select the point with the lowest likelihood 
(Lo) and remove it from the set of samples. The prior volume 
is then shrunk to xi with the shrinkage ratio t determining 
the volume decrease. The value of t is the expectation value 
of the largest of N random numbers from uniform distribu- 
tion between and 1, which is given by N/{N + 1). 

(iii) The removed point is replaced with a new one satis- 
fying the hard constraint likelihood L > Lq 

(iv) We then increment the Evidence by Lq x (xi — Xo). 

(v) We iterate over these steps until we satisfy some stop- 
ping condition. 

The selection of the least likelihood object is illustrated in 
Fig. [2] The method thus works its way up the likelihood 
surface through nested surfaces of equal likelihood. To ter- 
minate the process we can use two conditions. The first is 
given in (Skilling, 2006) as the number of iterations k re- 
quired for samples to converge to posterior peaks: 

k-NH»0 (11) 

Secondly we also requires that successive changes in evidence 
are sufficiently small. 



3 VARIATIONAL BAYES 

Variational Bayesian methods provide another approach to 
parameter estimation and model selection. The basic con- 
cept behind this class of methods is to try and approximate 
the Posterior distribution with a simpler probability distri- 
bution. One can then optimize the parameters in this ap- 
proximation so that it is as close as possible to the true 
Posterior. It turns out that such an optimal form for the ap- 
proximating distribution does indeed exist and this is usu- 
ally very easy to deal with analytically. There is then no need 
to sample from the Posterior as important quantities such as 
the mean can be derived analytically from the approxima- 
tion. Similarly the Evidence can be worked using standard 
integration techniques on the same approximation. Varia- 
tional Bayesian methods thus reduce the computationally 
complex sampling and integration problems to a relatively 
easy optimization one. 

Though there are many optimization algorithms in the 
literature which are used to find the best approximation 
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4 COMPARISONS 

In this section we can compare Nested Sampling and Varia- 
tional Bayes as applied to two engineered problems, the first 
being Polynomial Selection and the second being the fitting 
of Gaussian Mixture Models. We use MATLAB in this pre- 
liminary study to speed up the coding process. Naturally 
both methods are coded in the same language to try and 
ensure a fairer comparison of the computational times. 



Figure 3. A diagram based on that in (Zarb Adami, 2003) show- 
ing the relationship between the KL divergence and the actual 
Evidence. Here the value of the Bound is given by the integral 
given in Equation \17\ . 



the most widely used one is the Expectation-Maximization 
(EM) algorithm (Zarb Adami, 2003). We give a brief 
overview of the method here, a more complete explanation 
can be found in (Mackay, 2003). The sensible question to ask 
when approximating a complex function by some simpler 
one Q{0) is the amount of information lost in this process. 
A suitable metric describing this quantity which one can 
use to quantify the "distance" between the original distri- 
bution and the approximation is the Kullback-Leibler (KL) 
divergence (Kullback, 1959), denoted by Dkl- This is given 
by: 



DkUQWP) 



log 



(_cm_ 



(12) 



We note that the above expression always returns a non- 
negative value and, as one would hope, vanishes when 
Q(6>) = P(6'|D,I). Bayes theorem then tells us that: 



^kl(QIIP) 



/ 



QWlog 



0(^)P(o|i) 



dO (13) 



^L(D|6',I)P(6'|D,I)^ 

The Evidence P(D|I) is a quantity which is independent of 
the parameters and can thus be taken out of the integral. 



Dk^(Q\\P) 



I 



Q{0) log 



L{^e,l)^T{e\l) 



+ logP(D|I) 



(14) 
(15) 



The above equation suggests that one could define a cost 
function Ckl(Q||P) which we can then seek to optimize by 
minimizing the Kullback-Liebler divergence Dkl: 



Ckl(QIIP) 



i?KL(Q||P) 



logP 

Q{0) 



(16) 
(17) 



^L(D|6»,I)7r(6»|I) 
Since Dyj_^{Q\\P) ^ the following inequality always holds 



Ckl(Q||P) > -logP 
-Ckl(QIIP) < logP(D|I) 



(18) 
(19) 



We thus see how minimizing the cost function is equivalent 
to maximizing a lower bound on the Evidence value. FigurejS] 
shows the relation between the lower bound to the Evidence 
and the Kullback-Leibler divergence. Fortunately enough, in 
most cases the lower bound to the Evidence is tight enough 
to be used in place of the true Evidence in model selec- 
tion procedures (Miskin, 2000). After having optimized the 
approximating distribution one can then perform inference 
with the approximation in place of the real Posterior. 



4.1 Polynomial Selection 

When one has a sparse data set it is often useful to infer a 
smooth curve that best fits with the observed points. This 
can facilitate further studies of the data as procedures like 
integration or the finding of extrema then become very easy. 
Of course one might also want to fit data which has been 
generated by a physical law which has polynomial depen- 
dence on some parameter. The task at hand is thus to select 
the degree of the polynomial to use as well as calculate its 
coefficients. If a low order polynomial is chosen then it might 
be difficult to fit the data well whilst if the polynomial order 
is too high then there is a risk of over-fitting. We denote our 
data set D, where D contains a set of points {(x^, D^)}. Here 
1 ^ i ^ I and / is the total number of data points. The Xi 
are the abscissa values whilst the Di are the ordinate values. 
In the general case one can then construct an interpolation 
model using a set of K fixed basis functions. We thus have: 



^ Wnfni + ei 



(20) 



Here f = fni is a matrix constructed using the interpolat- 
ing basis functions evaluated at the abscissa values xi. If 
we decide to choose polynomials then we get fni = Fn{xi) 
where Fn{x) — x^~^ . The value of in Equation (20) rep- 



resents the noise affecting each measurement, which we can 
assume to be Gaussian with inverse variance 7. If we assume 
that the Gaussian noise is independent then the Likelihood 
function becomes: 



L{P\W, 7, I) = ]^ G A I ^ Wnfm, 



(21) 



We generate the data for this study by computing the actual 
values of a polynomial within a predefined interval [a, b] and 
adding the Gaussian noise of inverse variance 7 using MAT- 
LAB's inbuilt functions. Unless otherwise stated all the plots 
shown in this section are based on a data set containing 40 
data points generated in the interval [—2,2] from a quint ic 
curve with added noise having cr = 7~^/^ = 2. We first tackle 
the situation using Variational Bayesian and then move on 
to use Nested Sampling. For the variational solution we can 
assign conjugate priors: 



P(w|I) = G{w\0,a^) 
P(7|I) = Gamma (7 1 a'^, 6^) 



(22) 
(23) 



We can set the three arbitrary parameters , and ac- 
cordingly to modify the shape and scale of the priors. These 
are chosen to make the priors as broad as possible, to simu- 
late our ignorance of the underlying phenomena. The update 
equations, given in (Zarb Adami, 2003) and (Miskin, 2000), 
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Figure 4. Plots showing the data set fitted with polynomials of 
different orders using Variational Bayes. 



are shown below 



W 



^(7) 



w 



I+(7)c 



ff^ 



7) 



,(7) 



It 



(24) 
(25) 

(26) 
(27) 



The iteration of these equations leads to the values for the 
optimal parameter set describing the data. For our 40 point 
data set we can try fitting different order polynomials. The 
results are shown in Figure ^ One can observe how ineffi- 
cient the fitting is when we use a quadratic and how this 
improves once we use a cubic curve. Fitting with a quint ic 
(n = 6) we obtain: 



= (0.35, 
= 1.65 



-0.82, -0.03, -0.29, -0.04, 1.19) 



(28) 
(29) 



These results are not too far from the true values of w = 
(0,0,0,0,0,1) and a = 2 and an element of inconsistency 
between the two is acceptable as we have taken a large noise 
factor and not so many data points (40). Note that the odd 
components in the weight vector, which correspond to the 
even polynomials x^^, are especially close to zero. This is 
because the parity of these components (even) is different 
from the parity of the mechanism generating the data (odd). 
Hence these contributions are fitted to zero as they can offer 
no explanation of the data. Referring to Figure [5] we notice 
that as expected the Likelihood always increases with the 
number of parameters used to describe the data. However 
the Evidence is also dependent on an Occam factor which 
penalizes more complex models and is thus peaked, begin- 
ning to decrease past n = 6. The peak at n = 6, correctly 
implies that the most probable hypothesis is that the data 
was generated using a 6 parameter polynomial which is a 
quint ic. 

To implement the Nested Sampling algorithm for this 
example we take Skilling's code (Skilling, 2004) as a skeleton 
and modify the Likelihood and exploration functions appro- 



Figure 5. Polynomial Selection using VB. One can see the Log- 
Likelihood L (top left), the Occam factor (top right) and the 
Evidence (bottom). 




Figure 6. Polynomial Selection using Nested Sampling. The bot- 
tom right shows the Evidence whilst the others are different poly- 
nomial fits. 



priately to represent Equation (21) . We set uniform priors 



for 7 and w accordingly. Exploration of the parameter space 
using a different variable step size for each separate param- 
eter. The algorithm was then run with 36 objects. Nested 
Sampling is much slower than the Variational method as the 
average time of computation was in the region of 300 sec- 
onds, an order of magnitude slower then Variational Bayes 
which takes around 20 seconds for a typical run. This differ- 
ence in speed is most probably due to the fact that Nested 
Sampling is an MCMC method, involving the probabilistic 
exploration of a parameter space whilst Variational Bayes is 
analytical and has no such probabilistic dependence. Plots 
showing the Evidence and fits are given in Figure [6] Note 
that within the Nested Sampling framework we have no ex- 
plicit "Occam Factor". Rather the increase in parameters 
penalizes the complex models by giving the algorithm more 
space to explore and thus longer time to reach regions of 
higher Likelihood. This means that when calculating the Ev- 
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idence the higher Likehhood values are multiphed by smaller 
weights as shown in Equation (|9| resulting in a lower Evi- 
dence value over all. Again we get the Evidence peaking at 
the correct value of n = 6 with the optimum Evidence for 
both methods at around log(Evidence) = —110. The coeffi- 
cients themselves are also in very good agreement with the 
ones obtained using Variational Bayes and we get: 

w = (0.35,-0.82,-0.03,-0.29,-0.04,1.19) (30) 
a = 1.65 (31) 

In fact the greatest variation is in the relatively inconsequen- 
tial coefficient of and the value of this difference is 18%. 
This is not such an issue as the coefficient is very small any- 
way and has no major effect on the final plots. The averaged 
percentage disagreement in all the other parameters is under 
1% and the percentage disagreement in the coefficient of 
is only 0.04%. This indicates clearly that the discrepancies 
from the true parameters are due to the random nature of 
the Gaussian noise and do not arise because of any fault in 
either of the algorithms. Note that we give all the results for 
the parameters in Appendix A, along with the percentage 
difference between the two methods. 



4.2 Gaussian Mixture Models 

A Mixture Model is basically a distribution built up using 
a number of simpler distributions having different parame- 
ters. They are often used to obtain or substitute more com- 
plex distributions. An illustrative example given in (Miskin, 
2000) concerns the size of fruit. The probability distribution 
will obviously depend on the type of fruit being measured. 
Instead of having a single complex distribution it would 
make sense to model the size distribution for each fruit type 
using a Gaussian and then have a categorical variable which 
gives the probability that a fruit is of a given type. Math- 
ematically this can be expressed in terms of the Likelihood 
function for a single data point Di, where i labels the data 
point and can take values 1 ^ i ^ /. If we have S categories 
then we have: 



Data generated using a GMM 



L{Di\e,l) = J2P{si = s\Tr,I)P{Di\9s,l) 



(32) 



Here Si is an indicator variable for each data point which 
tells us which distribution created the i^^ data point. These 
are chosen probabilistically such that P(s^ = s|7r,I) = tts. 
Also, P{Di\Os,T) is the Likelihood for a given Si and data 
point Di. We henceforth consider the particular case when 
the mixture is composed of Gaussians, known as a Gaussian 



Mixture Model (GMM). Equation (32) then becomes 



L(A|^,II) = ^7r.G(A|/is,a.) 



(33) 



The parameters /i = (/ii, /i2, Ms), cr — (ai, (72, crs) and 
TT = (7ri,7r2, ...,7rs) are collectively referred to as 0. 

A histogram of the generated points used in this sec- 
tion is shown in Figure [7| The data set contains 300 points 
generated from a Mixture Model composed of three Gaus- 
sians. These have means /i = ( — 1,1,3) and sigma a — 
(0.4, 0.3, 0.7) . The points were produced in the following 
ratio (0.3 : 0.35 : 0.35). The data set is relatively easy to 




Figure 7. Histogram showing the distribution composed of 
three Gaussians. These had means = (—1,1,3):, sigma a = 
(0.4, 0.3, 0.7) and were produced in the following ratio (0.3 : 0.35 : 
0.35). 



handle for the algorithm because there is only a slight over- 
lap between the different component distributions. 



4.2.1 Variational Bayesian Methods 

We first set out to attack the problem using Variational 
Bayesian methods, fitting the data with S Gaussians. We 
use conjugate Priors as required by the Variational Bayesian 
algorithm, using s to label the variables pertaining to each 
of the S Gaussians. 



1 



-|I) 

CTs 

P(7r|I) 



G{l3s\bos,cos) 
D{ns\Xos) 



(34) 
(35) 
(36) 



The first two equations are Gaussian Priors for the parame- 
ters pertaining to a single Gaussian in the mixture whilst the 
third, D, represents a Dirichlet distribution over the categor- 
ical weights with mixing hyperparameter Aos as described in 
Appendix A. If we assume the independence of the separate 
data points and we consider all the distributions forming the 
mixture, the Prior becomes: 

s s 
P{e\I) = P{n\l) II Pim H P{p4l) (37) 

Note that here fis, /J^s, bos , cos , tt and Aos each refer to a single 
one of the Gaussian distributions forming the mixture. The 
joint Likelihood of any single data point is given by Equation 



(38). 



L{D^,Si\e) = L(s = Si\n)G{D^\l3s,l^s) 



(38) 



If we assume that the data are independent then we con- 
struct the final Likelihood by taking the following product: 



L(D, s\e) = Yl L{si = s)G{Di\l3s,fis 



(39) 



Here D = (Di, D2, Dj) and s = (si, S2, s/). The in- 
clusion of the indicator variable introduces additional com- 
plexities into the derivation of the update equations for the 
Variational Bayes algorithm. The procedure is described in 
(Zarb Adami, 2003), (Miskin, 2000) and (Penny & Roberts, 
2000). Here it is the results from the latter approach that we 
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Figure 8. GMM Results using VB. We can see here three at- 
tempts to fit the data with different amounts of Gaussians as weh 
as a plot of the Evidence bound (top left). 



use and we refer the reader to this paper for the algorithmic 
details. The algorithm is stopped when successive changes in 
the Evidence bound are less than 10~^. The plots in Figurejs] 
show the results of fitting different numbers of Gaussians to 
the data. The fit having 3 Gaussians results in the following 
values for the means and the widths of the three component 
distributions: 



Figure 9. A plot of the log-Likelihood, the Occam factor and the 
Evidence for different numbers of Gaussians. 



^ = (-0.94,1.02,2.98) 
a = (0.42,0.33,0.68) 



(40) 
(41) 



These are in close agreement with the true values with which 
the data was generated. The deduced ratio of the three 
Gaussians is also correct, given by 



0.299 : 0.353 : 0.347 



(42) 



The plot of the evidence bound in Figure ^ peaks at the 
value of three, correctly indicating that most probably three 
Gaussians were used to construct the Mixture Model. The 
code takes around 25 seconds to run when testing for 1 to 6 
Gaussians. 



4-2.2 Nested Sampling 

We now apply the Nested Sampling algorithm to tackle 
the Gaussian Mixture Model data. As a stopping condition 
we impose that the difference between successive values in 
log Evidence should be less than 10~^ and that the condi- 



tion in Equation (11) is satisfied. In the results quoted here 



we utilize 50 objects in order to ensure a decent algorithmic 
speed. Once again the Nested Sampling implementation is 
considerably slower, with the code taking approximately 8 
minutes to try out 1 to 6 Gaussians. Again, as with polyno- 
mial fitting, we can attribute the difference in computational 
speed to the random nature of the Nested Sampling algo- 
rithm. We note here that for both polynomial fitting as well 
as Gaussian Mixture Models the Nested Sampling imple- 
mentation is our own and hence further work might be able 
to improve the computational speed. However, we do not 
believe that any increase in speed for Nested Sampling will 
be able to improve the computational time to Variational 




Figure 10. GMM Evidence Plot using Nested Samphng (right) 
and fitted GMM (left). 



Bayes levels. In Figure [To] we show the Evidence against the 
number of Gaussians used for fitting as well as the optimal 
fit using three Gaussians. We first note that the figure gives 
a peak at the correct value of 3, indicating that 3 Gaus- 
sians were used to generate the data set. For the optimal 
fit the three Gaussians have means /x = (—0.97, 1.02,2.95), 
sigmas a = (0.40, 0.35, 0.70) and are produced in the ratio 
(0.292 : 0.354 : 0.354). These are very close to the true values 
which were used to generate the data set and in particular we 
note that the ratios are calculated extremely well. Differing 
runs of the Nested Sampling algorithm, particularly when we 
run with more stringent stopping conditions, result in pa- 
rameter values which differ by 0.03 at most. For a run with 
a stopping condition of 10~^ we get /i = (—0.95, 1.01, 2.93), 
a = (0.42,0.32,0.73) and ratio = (0.291 : 0.354 : 0.354) . 

We also test both algorithms using a harder data set 
where there is far more overlap in the Gaussians. To per- 
form this analysis we generate 600 data points with means 
/J = ( — 1,0,1), leaving the other parameters as before. 
The results are again favourable. In particular the Evidence 
bound peaks at the correct value of 3. Also the inferred pa- 
rameters of the 3 fitted Gaussians are largely in line with 
the true values: 



^ = (-0.96,0.04,1.06) 
a = (0.39,0.28,0.70) 
The ratios are given by: 

0.326 : 0.342 : 0.332 



(43) 
(44) 

(45) 
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Figure 11. GMM Results using Variational Bayes with a hard 
data set. On the left we see the optimal fit using three Gaussians. 
On the right we have a plot of the Evidence against the number 
of Gaussians used. 




^1 



Figure 12. GMM Results using Variational Bayes with a hard 
data set. On the left we see the optimal fit using three Gaussians. 
On the right we have a plot of the Evidence against the number 
of Gaussians used. 

The largest discrepancy is in the ratio values. This proba- 
bly stems from fact that because the Gaussians are so close 
to each other it is harder to determine from which Gaus- 
sian each point has been generated. These results are shown 
in Figure |11| Investigating the same data set with Nested 
Sampling we get the following results. The Gaussian posi- 
tions returned by the algorithm are given by: 

^ = (-0.98,0.03,1.02) (46) 
a = (0.39,0.29,0.70) (47) 
ratios = (0.331 : 0.340 : 0.329) (48) 

Nested sampling produces slightly better values for the 
Gaussian means and widths but the ratios are again incor- 
rect. The values are however within one standard deviation 
of the true results. Again there is good agreement between 
the Variational Bayes and Nested Sampling methods and 
differences in the results from the true values are reflected 
in both techniques. This indicates that much of the imper- 
fections in the inference are due to the random nature of the 
generated data and not due to the algorithmic deficiencies. 



a few percent of each other and the peaks in the Evidence 
plots always agree. The discrepancy in speeds was attributed 
to the fact that Nested Sampling is an MCMC method and 
is dependent on random exploration of a parameter space 
while Variational Bayes is analytical and has no such proba- 
bilistic element. We do however point out that it is far easy 
to formulate a Bayesian solution to a problem using Nested 
Sampling than using Variational Bayes as in the latter case 
one must derive the update equations for each problem and 
this might be a non-trivial process. 
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5 CONCLUSIONS 

Virtually all experimental physical analysis done today is 
somewhat statistical in nature and therefore necessitates the 
use of powerful computational tools. The Bayesian frame- 
work is one such set of tools which is gaining popularity 
within the scientific community. Our main conclusion is that 
whilst both Variational Bayes and Nested Sampling give 
similar and accurate results when tackling engineered prob- 
lems the former is much faster than the latter. Inferred pa- 
rameters using the two different techniques are often within 



Appendix A: Notation 

We describe the distributions used throughout this paper as 
well as their notation. 
Gaussian Distribution 

The Gaussian (or Normal) distribution is given by the Equa- 
tion [2] where we consider all previous knowledge to be en- 
coded in I 

P{x\l) = G{x\x,x) (1) 
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Here x is the mean and x > is the inverse variance. Hence 
the standard deviation, a given by the square root of the 
variance can be expressed as follows: 



variance ■ 



(3) 



Useful expectation values under the Gaussian distribution 
are: 



(4) 



Multivariate Gaussian Distribution 

The Gaussian can be generalised to d dimensions, in which 
case it is called a Multivariate Gaussian: 



P(x\I) 



G{x\x, x) 



(27r)^ 



exp 



(5) 
(6) 



Here x is a vector containing the mean in each dimension and 
X > is the symmetric and positive definite inverse covari- 
ance matrix. This is necessarily a symmetric matrix because 
the cross terms are related to the covariances of the variables 
and the covariance relation is necessarily a symmetric one. 
When the variables are independent the covariance matrix, 
and hence the inverse covariance matrix, is of diagonal form. 
Useful expectation values under the multivariate Gaussian 
distribution are: 

<x>p—x < x"^ > p— x^ + x~^ (7) 

Gamma Distribution The Gamma distribution crops up 
in the Bayesian framework because it is the conjugate dis- 
tribution for the inverse variance of a Gaussian distribution. 
The distribution itself is given by: 



P{x\l) 



Gamma(cc|a, h) 
1 

m 



b (b-1) / \ 

a ' exp(— ax) 



(8) 
(9) 



The constant a > is a scale variable and 6 > dictates the 
shape of the distribution. Some useful expectation values 



The expectation value of tt^ is given by: 

A. 



< TTi >p = 



(14) 



This paper has been typeset from a TgX/ lOT^X file prepared 
by the author using the Blackwell Science MN journal doc- 
ument class file. 



<X>P 



< log X > p - — log a + 



a log r (6) 

dh 



(10) 
(11) 



The derivative '^^^ is known as the digamma function 

and is often denoted by ^(6). 

Dirichlet Distribution The Dirichlet distribution is used 
to model categorical weights, particularly in the Gaussian 
Mixture Model setting. The probability distribution func- 
tion for a Dirichlet describing m weights is given by: 



^ ^ nr=ir(A.)iJ"^ 



(12) 



The parameter Ao is called the mixing hyperparameter. The 
word hyperparameter is often used in Bayesian statistics to 
denote parameters in priors and to distinguish them from 
the actual parameters of the underlying model under inves- 
tigation. When the As are all equal we get the symmetric 
Dirichlet distribution: 



r(mAo) 

r'"(A,) 



(13) 
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